CNNs for Dachshund Detection in TensorFlow

  1. Data preparation
  2. Simple CNN
  3. ResNet
  4. EfficientNet
    4.1 Fine-tuning
    4.2 Saving and loading EfficientNet
  5. Visualising convolutional networks
    5.1 Class saliency maps
    5.2 Object localisation
    5.3 Class model visualisation
    5.4 SHAP and Lime
  6. Quantisation for deployment

We want to investigate various convolutional neural networks for the purpose of distinguishing dachshunds from other dog breeds. Our ultimate goal is to build a lightweight model and deploy it to Heroku with Flask with a simple frontend where users can upload their own images and see the result of the classification. You can try out the finished product at https://doxie-detector.herokuapp.com (note: the initial boot-up can take a while on Heroku's free tier). The current notebook is continuation from linear_models.ipynb, where we applied classical linear methods to obtain a baseline accuracy for our problem. By utilising principal component analysis and support vector machines we achieved 61.2% validation accuracy. We now want to see how much we can improve on this with more sophisticated models.

The plan for the rest of the notebook is as follows. We'll first briefly overview the dataset and prepare it for usage in Keras models. Then, we'll train a simple CNN from scratch, which gives us a modest improvement in terms of accuracy. To further improve on this we perform transfer learning on existing models, namely ResNet and EfficientNet. This allows us to obtain a decent accuracy of ~93.6% without much effort. We then investigate the finished model with the help of saliency maps and class activation maps. As an application of the saliency maps, we also develop a novel simple algorithm for object localisation, which does not use any additional input from the network. Finally, we'll briefly investigate two popular methods for general ML model explainability: SHAP and Lime.

1. Data preparation

Our datasets consists of 2129 pictures of dachshunds and dogs of other breeds spread roughly evenly across both classes. All of the pictures were scraped from various public sources such as Dog API and reddit. The only constraint for the selected images was that each should contain exactly one dog (at least partially visible). No preprocessing or pruning of any kind was done, so the dataset includes plenty of out of focus photos, poorly cropped or centred pictures or photos where only a small part of the dog is visible, to name just a few. So it is possible, for example, that a sample image doesn't show the face of the dog at all, which means that our model will have to learn to be very versatile. Moreover, some of the dogs have multiple pictures of themselves in our dataset which we expect will add extra noise to the validation accuracy.

We split these images into a train and validation set with a 80-20 split and resize each of them to 224x224 pixels. We do not use a proper test set, but instead investigate our final model with a few select images of family dogs (who are not present in the train or validation datasets at all).

We're training our models locally with only one NVIDIA GTX 1660 Ti 6GB. With the following cell we can limit TensorFlow's memory usage to 4GB in order to keep the system operational during training.

2. Simple CNN

We first train a simple sequential CNN architecture from scratch. It is based on the VGG16 network (see Configuration A in Table 1), but we reduce the size of the fully connected layers since our dataset is small and we only want to perform binary classification. To alleviate problems with high variance due to the small training set, we include regularisation in the form of some simple image augmentation steps and BatchNormalisation layers after each convolutional layer. As usual we also include 0.5 dropout for the fully connected top layers.

We use the Adam optimiser with learning_rate=epsilon=0.001 for training. We see that after ~40 epochs we obtain a 70.3% accuracy which is roughly a 9 percent point improvement over our baseline at the cost of a much more complicated model.

The learning curves show that after about 50 epochs the model starts overfitting, but we recover the earlier model with less variance with early stopping. It seems plausible that with some more regularisation (e.g. adding dropout to each convolutional block) we could keep training and obtain improvements in accuracy. On the other hand, looking at the smoothed validation accuracy it is highly possible that the final 70% accuracy is just noise and the true (generalisation) accuracy of the model is somewhat lower. In the next section we will replace the convolutional part of our model with an existing network pretrained on the ImageNet dataset.

3. ResNet

In order to improve on our sequential network, we next consider more sophisticated structures. Also, instead of training from scratch we use transfer learning with an existing network trained on the massive ImageNet dataset. Our first choice for such an architecture is ResNet, originally introduced by He et al. (2015) (our particular version of this network is tf.keras.applications.resnet_v2.ResNet50V2, which is a moderate improvement over the original ResNet50 structure). In this seminal paper the authors introduced so-called skip connections, where the outputs of a convolutional block completely bypass the immediately following block and are directly added to the input of the subsequent block. On the one hand, this has a regularising effect allowing them to train much deeper networks than e.g. VGG since gradients are free to travel (without decaying) through the identity mappings of the skip connections. On the other hand, the deeper layers now have access to finer spatial info (in the form of features from earlier convolutional blocks with smaller fibres) allowing them to learn more complex features. All in all, ResNets vastly outperform the traditional convolutional networks.

Typically, when doing transfer learning, one extracts the trained bottom layers of the pretrained model and discards the final fully connected top layers. A new set of fully connected layers, customised to the particular task, is then attached to this pretrained base and only the weights of the new layers are modified during training. Since ImageNet-trained models learn a very rich set of features, we regularise our fully connected layers with a typical 0.5 dropout.

With this method we're able to obtain 88.4% accuracy in roughly 15 training epochs. This is not surprising since the ImageNet dataset already contains many pictures of dogs so the model is used to detecting them. The downside of our model is that it is rather huge both in terms of disk space and memory requirements at inference time. This unfortunately makes our initial attempt unwieldy for the purpose of deploying it directly to Heroku with Flask. We could circumvent this by setting up a TFServing server, for example, but typically it's more desirable to trim the "unnecessary fat" from models that we want deployed to production.

We clearly see how the model manages to avoid overfitting with help from the dropout layers and converges to a stable state. Let's see what the model predicts for our test pictures.

All our family dogs got appropriately classified! In the next section we decrease the size of our model by using a more recent architecture tailor-made for mobile and edge devices. It contains vastly less weights, despite offering comparable (or even better) performance. Let's see whether this will be enough for our purposes.

4. EfficientNet

EfficientNet (Tan and Le, 2020) was, at the time of its introduction, somewhat of a culmination of the evolution of multiple convolutional architectures. It did not introduce new ways to arrange layers or move around the network, per se, but focused more on various aspects of optimisation that had fallen to the sidelines with new innovative CNN models appearing almost every year. Their key idea was to realise that instead of treating existing architectures as fixed, they should be scaled with various parameters (depth, width and resolution) in a controlled manner based on the task at hand. In the paper the authors applied this technique to many existing networks drastically reducing the number of parameters while retaining comparative performance. In particular, they introduced a series of optimised networks, called EfficientNetB0-EfficientNetB7, where the complexity grows from B0 to B7 and each network had been tuned with their methodology.

For our problem we pick one of the smaller networks, namely, B1. To further reduce the complexity, we replace the fully connected final layers that we used with ResNet by global average pooling. Global average pooling was introduced by Lin et al. as a part of their paper Network in Network (2014). This type of layer simply compresses each filter from the previous convolutional layer by computing its mean. The intuition for this was that the pooling layer retains some of the original spatial structure which would normally be lost after flattening. Moreover, such a layer has no weights to train (because we are just taking an average over each individual filter) which helps against overfitting. GAP also makes the network smaller, because usually it is the final fully connected layers of a CNN which contain the majority of the model's weights.

We obtain a good improvement over what we had with ResNet. Notice that we also train for a lot longer even though the validation accuracy quickly reaches a fairly stable level around 93%. This is because we want to attempt to further increase the accuracy by fine-tuning our model and for this it's important that the top layers have fully converged before we unfreeze any weights. If the final layers were unstable we would risk throwing the pretrained weights of the newly unfrozen layers off-balance as they would have to adjust to large backpropagated gradients.

4.1 Fine-tuning

We can see below that the EfficientNet network consists of several stages, which each contain a number of convolutional blocks (see Table 1 in the paper). We will unfreeze only the last block of the last stage and train with a much smaller learning rate of $10^{-6}$.

We don't really see much gain in terms of accuracy, but there is a relative gain of ~15% in performance in terms of the loss. However, it seems that if we were to finetune for much further we would be at risk of overfitting. Hence, we'll call it a day and move on to investigate what our network has actually learnt. First though, let's quickly check the predictions for our test set again.

4.2 Saving and loading EfficientNet

As a brief aside, there is currently a bug in Keras' implementation of EfficientNet and the model.save_model function when used with custom gradients. It is still possible to load the saved model and do inference (with a bunch of warnings thrown at you), but any further training has a possibility of failing. This issue is tracked at https://github.com/tensorflow/tensorflow/issues/40166#issuecomment-756702752, where a workaround is also provided. Since we are finetuning our model, we have to modify the code slightly.

5. Visualising convolutional networks

Our final task is to investigate our model and to determine whether it has actually learnt what we would expect from it. There are multiple ways to do this including saliency maps, class activation maps, class model visualisation and inspection of specific filters based on image patches with the highest activation, to name just a few. For now, we'll focus on saliency maps and class model visualisation and leave the rest for future work. We'll also show how to use saliency maps to perform crude object localisation. We verify our observations with the help of two existing methods for model explainablity: SHAP values and lime models.

5.1 Class saliency maps

Class saliency maps were introduced by Simonyan et al. (2014). The basic idea is simple: we compute the derivative of the output of our network with respect to the input. This tells us which pixels in the original image would have the largest effect on the output (i.e. the class posterior) if we were to alter them by a small amount. Since our pictures have 3 channels, we choose the max of the absolute values across the channels for each component of the derivative to get a single channel output.

Here is a simple implementation of the saliency map computation with the help of tf.GradientTape.

5.2 Object localisation

In the original paper the authors mention how saliency maps can be used as part of a more sophisticated localisation algorithm (which falls far short of modern methods, though). The idea is that by thresholding on different quantiles of the saliency map distribution (30% and 95% to be exact), we can separate regions of the image into foreground and background pixels. Then, by fitting Gaussian mixture models to both parts, we obtain a probabilistic representation for either part of the picture. It is then possible to apply existing methods in computer vision to obtain good results. We won't go further into that here, but will simply show a quick example on how to fit the Gaussian models:

What we want to do instead is a lot simpler. We'll treat the saliency map as a mass distribution over the original image and use it to find a bounding box, which captures an optimal portion of the total mass relative to its area. More precisely, we first compute the centre of mass of the full mass distribution, located at, say, (c_x, c_y), and start fitting bounding boxes relative to it according to the following algorithm:

  1. Fix the centroid (c_x, c_y) of a given saliency map.
  2. Create a rectangle R with a relative centre (c_x, c_y) which fits the full image. By a relative centre we mean that if we adjust the width of any pair of sides of R, then (c_x, c_y) has to stay within the adjusted rectangle.
  3. By picking one direction at a time (relative to the centroid: left, right, top, bottom) create a new rectangle R' by moving the chosen side of R perpendicularly towards its centroid so that the relative mass inside R' is tol of the total mass in R (a good default value for our data seems to be around tol=0.90).
  4. Set R:=R' and continue from step 3. with the next chosen side.
  5. After adjusting all 4 sides, output the final R as R_final.

Notice that the remaining total mass is adjusted in step 4. after the assignment. With this method we obtain quite stable bounding boxes from our saliency maps and they seem to be somewhat accurate as long as the classified object doesn't fully occupy the original image. To ensure stability, we also perform some additional smoothing in two ways: first, run the algorithm for each possible ordering of sides (i.e. we get 24 rectangles) and use their mean as R_final, and second, in step 3. we choose the final side length only after smoothing the curve for the change of mass (when moving that side).

We've created a simple class Rect (see rel_rect.py), which allows us to work with rectangles with a relative centre. The constructor of Rect takes 6 arguments c_x, c_y, r_w, b_h, l_w, t_h, i.e. the coordinates of the relative centre, right width of the rectangle, bottom height, left width and top height, respectively. The class then coerces the rectangle so that it is guaranteed to fit inside our initial image (of size 224x224) and exposes attributes Rect.x_left, Rect.x_right, Rect.y_top and Rect.y_bot for the coordinates of each of the sides. We can then adjust these sides separately with the methods Rect.set_rw(r_w), Rect.set_bh(b_h) etc. After each adjustment the class automatically computes new values for all the properties of the rectangle.

We need a simple function to compute the mass of the saliency map. This is nothing but the sum of the weights (which have been normalised to lie between 0 and 1) within the given rectangle. The function get_mass below takes a positional argument x, which is the full saliency map, and an optional keyword argument r, which is the rectangle whose mass we want to compute. If no rectangle is provided, we return the total mass of the saliency map.

We can then implement the algorithm we described as follows. It consists of three functions: adjust_side, adjust_rec and fit_recs. adjust_side does the brunt of the work. For a given rectangle it adjusts the location of the given side so that the relative mass is approximately equal to tol (after smoothing). Note that this adjustment is done in place. adjust_rec then simply calls adjust_side repeatedly for a given ordering of the sides for the inputted rectangle and returns a new rectangle, which is R_final. Finally, fit_recs is responsible for constructing the saliency map for a given input image, calling adjust_rec on it for all permutations of side orderings and for smoothing the final result.

We can see the finished result below. The centroid is plotted as a red dot, each of the 24 R_final rectangles is drawn in red and finally their mean is shown in cyan. We'll see later that quite often the centroid is not located in what we would typically think of as the centre of the object (dog, in this case). However, based on our experiments, this does not in fact seem to be an issue and our final rectangles are fairly stable even if the centroid is very close to the edges. Moreover, our original reasoning for using this form of construction was to ensure a certain reasonable constraint for each rectangle, which should help with stability. We'll soon see that this is indeed what seems to have happened.

Here we see the algorithm applied to each image from our test set. It seems to provide reasonably good results except possibly when the target object occupies the whole image.

There are multiple ways to improve this algorithm. One idea would be to use np.grad to incorporate the rate of change of mass as we adjust the sides, which would allow us to pick the threshold in a more controlled manner. We could also try to use the Gaussian mixtures model for the foreground and background pixels from last section and fit the bounding boxes based on that instead. It would also be interesting to investigate how our method would perform on a more complicated network trained on a bigger dataset (e.g. multiclass models on the ImageNet dataset).

5.3 Class model visualisation

In the same paper the authors also discuss class model visualisation. The basic idea is to again take the gradient of the output with respect to the input and then modify (with gradient ascent) the input image so as to maximise the final class score. Ideally, this procedure should show us what the network perceives as a picture with maximal probability of being a dachshund. More precisely, our learning objective is the following. Denote by $\mathscr{I}$ the input image and let $S_c(\mathscr{I})$ be its linear class score (i.e. the output before sigmoid activation, so $S_c > 0$ corresponds to $c=\text{dachshund}$). We then want to find $\mathscr{I}$, which is a solution of the optimisation problem

$$\mathop{\mathrm{argmax}}_{\mathscr{I}}S_{c}(\mathscr{I}) - \lVert\mathscr{I}\rVert_{2}^{2},$$

where the additional $\ell_{2}$-regularisation ensures that the pixel values don't blow up and the final image stays somewhat smooth (by strongly penalising outliers). We do this by initialising $\mathscr{I}=\mathbf{0}$ and performing gradient ascent. We then add the mean $\mathscr{I}_{0}$ of the training set to the solution $\mathscr{I}$ to obtain our class visualisation.

Notice that it is important to use the class score and not the final activation. To see this, consider a multiclass classifier with softmax activation. The final output of such a network for a fixed class $c$ is $e^{S_c}/\sum_{c'}e^{S_{c'}}$. As pointed out in the paper, if we were to maximise this, we could do it simply by minimising the class score for all classes $c'\neq c$, which would not imply that the final image is representative of the features of class $c$. There seems to be an additional reason, which was not mentioned in the paper. In our case we only have 2 classes so the above is not a problem a priori. However, both softmax and sigmoid functions have the same issue that as the activation saturates the derivatives vanish (which is one of the reasons why most deep NNs have pivoted to ReLU activation in hidden layers). But this is precisely what we're optimising for and so if we were to use the class posteriors then it makes sense that the training would slow down considerably. Therefore, we remove the sigmoid activation and consider the linear output of our model instead.

We run the Adam optimiser for 3000 iterations with learning rate 0.3 and epsilon $10^{-12}$. Based on our experiments, it seems that the choice of optimisation algorithm seems to have a big effect on the visuals of the final image. Moreover, we obtain visually better results with the flag training = True since this keeps any BatchNormalisation layers active.

On the right is the pure optimised image $\mathscr{I}$ (normalised) and on the left we have the final class model $\mathscr{I}+\mathscr{I}_{0}$. The results are not as pleasing as we expected (cf. the representations that they obtain in the original paper), but it is certainly possible to see that the model seems to be focusing a lot on the shape of the ears which are very unique to dachshunds. Below we show some other class model visualisations (which might present higher-level features) for the dachshund class that were obtained with different training parameters:

Class models for dachshunds

The results for the class "other", which contains a huge variety of breeds, are less pleasing:

Class models for others

Video of backpropagation for dachshund class model:

Our complicated final image $\mathscr{I}$ obtains an extremely high linear score of ~38.9, which, after sigmoid activation, is equal to 1.0 to 16 decimal places. Notice that in the original paper the features of each particular class are more prominent in their class model visualisations. It is unclear whether this was generally true or if these examples were cherry-picked out of the 1000 classes. However, we present a few possible explanations as to why our results seem a bit different. First of all, as we pointed out above, the optimisation problem is extremely tricky. This is because we don't want to converge to just any local optimum, but one which provides us with some visible high-level features that humans can identify when looking at the picture. It is therefore difficult to find the right combination of hyperparameters to achieve this. Second, since we trained a model for a binary classification task between two very similar types of objects, it is plausible that our model focuses heavily on smaller details as opposed to what typical multiclass classifiers might do. It would be possible to determine whether this is indeed the case by examining which image patches activate which filters of the higher convolutional layers in our model. We'll leave that for future work. Another interesting avenue for investigation would be to see how the class models are affected if we keep other training parameters constant but initialise our image $\mathscr{I}$ with some mild (clamped) noise.

In the following cell we have the code we used to create the frames for the animation.

5.4 SHAP and Lime

In this section we use existing implementations of two popular methods for explaining machine learning models, namely SHAP and Lime. Both of these methods are model agnostic, which means they will work with any machine learning model. Moreover, the implementations we use can explain many data types, such as tabular, textual or image data, with little effort. The crux of this class of "explainers" is to approximate the complicated deep learning model (in this case) by another model (called the explanation model) that is easier to understand, such as a linear model or a decision tree.

The first method we'll look at is called SHAP and is based on the paper by Lundberg and Lee (2017). The basic idea is to extend the notion of Shapley values to the setting of complex machine learning models. These values, for each individual prediction, are composed of feature weights that describe how much (either positively or negatively) the prediction of a model with that feature included improves on one lacking that particular feature. The sum of these weights is the SHAP value. In general computing these values exactly is extremely computationally intensive (not only because it would require refitting a model at each step), because we need to figure out how much adding a specific feature to any subset of input features changes the prediction so one has to consider the power set of the features and the permutations of the elements in each. The idea proposed in the paper is to approximate this simplified model (with some missing features) by taking expectations and considering $\mathbb{E}(f(z)|z_{S})$, where $f$ is the original model and $z_{S}$ denotes an input with features not in $S$ set to zero. This quantity can then be approximated by assuming that the features are independent and that the model is linear. In the case of image data the raw features are of course the original pixels, but since this number is usually untractably large many explanation models simplify the situation by considering super pixels (neighbourhoods of pixels with similar colour values) or larger number of pixels at once.

For neural networks, SHAP comes with two types of explanation models that accurately approximate the true SHAP values: shap.DeepExplainer and shap.GradientExplainer. Unfortunately, both of these are currently bugged with TF 2 models containing specific types of layers (e.g. global pooling) so we cannot use them. Instead we have to rely on one of the more generic explainers, shap.Explainer, which works for any model, but returns less accurate approximations and with less efficiency.

In the above explanation the colour and intensity of each rectangle shows much it contributed to the final class prediction (i.e. 1 for a dachshund) with red signifying a positive contribution and blue negative. Naturally the above picture shows nothing surprising based on our experiments with saliency maps, save for the fact that the model seems to put great emphasis on the shape of the head and nose. If we were able to use one of the NN-specific explainers we might be able to obtain more insight with higher fidelity explanations. Later we'll remedy this issue with Lime, but first let's look at a few more examples.

Notice that in the case of a negative prediction (i.e. 0 for "other") it is the blue squares that encode the features contributing to the model's decision to not classify the dog as a dachshund since we are dealing with binary classification.

In order to get more precise explanations we'll next look at Lime based on the work of Ribeiro, Singh and Guestrin (2016). This predates SHAP somewhat and can be seen as a local approximation to the true SHAP values (see the discussion in the SHAP paper). In practice, Lime looks at the super pixels weighted by their proximity to the region of the picture it is trying to explain. It then fits a linear model, according to these weights, which locally approximates the underlying black box model. One typical issue arising from this approach is that the definition of local and neighbourhood greatly depend on the task or even the particular model at hand and that it is generally difficult to define good default values. Instead, one has to have some baseline understanding for what constitutes a reasonable explanation for the model in order to tune these parameters via trial and error. See here for further discussion. In any case, as we see below this is rather simple for image based models and we gain some further insight that was not evident with shap.

As we can see the information provided by Lime is much more precise than by SHAP (mainly because it relies on super pixels rather than fixed regions of the image). In the above the green regions indicate what Lime thinks the underlying model considers important for the final prediction, whereas the red regions denote parts of the image which decrease this confidence. Interestingly, we notice that the model doesn't pay much attention to the head in this prediction (possibly since it blends in with the rest of the body, but it does think that the curve of the lower body, the tail and the shape of the ear are unique identifiers of dachshunds (which definitely is believable!). Let's now look at a negative prediction. As before, the purpose of the colours is reversed in this case since we're trying to predict the negative class.

Here we notice something interesting. The model seems to completely ignore the eyes and the nose of the dog. Perhaps this is one reason why these particular features were not very visible in our class model visualisations which seemed to focus more on the texture of the fur and the curves of the body of the dog. On the other hand, we see an emphasis put on the shape of the lower part of the head, which is definitely distinct for dachshunds versus other breeds, along with the red region under the dog. It's possible that the model is looking at the legs or the paws here, but we cannot immediately conclude this based on the crude explanation above.

6. Quantisation for deployment

Now that we are satisfied with our model, it's time to deploy it to Heroku. Alas, even with all the simplifications we still end up going over the allotted RAM at inference time. In order to get around this, we quantise the model with tf.life to use a smaller precision. A typical default target is to convert the model to use uint8, but this is only optimised to run on ARM architecture. For Heroku we instead quantise to 16 bit floats. The final inference model ends up being only 22 MB in size, whereas the full model exported with tf.keras.models.save_model is well over 100 MB. As a comparison, the ResNet model takes about 42 MB even when quantised.

In order to avoid using Heroku's finicky filesystem, we store the user uploaded file only in memory and pass it as a binary buffer to the html template. The complete inference process with the quantised model looks something like the following: